We received the writing assignments from the softskills department. This data has many null values(missing assignments) and repeated values(Student Details and Question related Information). We dropped all the unwanted data and missing student assignmets. This data is used to generate a csv file which has all student information.
In [39]:
import pandas as pd
df = pd.read_csv('scan1.csv',sep=',', header=None, names=['author_label','ass_num', 'author_writing'])
# df = pd.read_csv('bow3.csv',sep=',', header=None, names=['author_label', 'author_writing'])
# Output printing out last 5 columns
df = df.drop('ass_num', axis=1)
df.tail()
# print len(df['author_writing'][0].split(" "))
Out[39]:
Check the number of data points. Our dataset contains 1028 tuples. and two columns.
In [40]:
print(df.shape)
We should train the model before testing. But we need a training set and a testing set. So, we should divide the data into test and train sets by using train_test_split module from sklearn.
In [41]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['author_writing'],df['author_label'],random_state=1)
print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))
To apply the Naive Bayes theorem to our dataset, we should convert all our data into numeric values since sklearn can't work with non numeric values. So, we created a frequency matrix(word frequency) of our dataset and apply Bayes theorem to that. So, we used a module called CountVectorizer to that.
In [42]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)
Now, we should apply bayes technique for this dataset. So, we should import MultinomialNB module from sklearn. Here, we use Multinimial Naive Bayes because it good for classification with discrete values.
In [43]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
Out[43]:
In [44]:
## Now that we trained our model, it's time to test the model with the dataset.
predictions = naive_bayes.predict(testing_data)
The performance our model can be known by computing the accuracy, precision, recall and the f1 score of our model.
In [45]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions,average="weighted")))
print('Recall score: ', format(recall_score(y_test, predictions,average="weighted")))
print('F1 score: ', format(f1_score(y_test, predictions,average="weighted")))
Since we got very less scores, we can't just conclude that our model is wrong. The other factors can also affect. Here we found out some factors which affected the scores.
In [ ]: